2023MCS120003_miniproject

Sumod Sethumadhavan

17/11/2023

Project Report: Analysis of Air Quality and Climatic Trends in India

Introduction

In this study, we investigate the interplay between air quality and climate in India, focusing on key pollutants and meteorological patterns. Our analysis, grounded in comprehensive datasets spanning several years and cities, aims to elucidate the distribution of air pollutants, such as ozone, particulate matter, and nitrogen dioxide, and their correlation with climatic variables like temperature and rainfall. Through this exploration, we seek to understand the dynamics of environmental quality and its implications for health and well-being. The project’s core objective is to provide a nuanced understanding of environmental conditions in India, offering insights valuable for policy development and environmental management.

Objectives

The objectives of this mini-project encompass a comprehensive analysis of air quality, climatic trends, and rainfall prediction across various cities in India. This endeavor is multi-faceted, aiming not only to assess and compare current conditions but also to predict future trends. The objectives can be broadly categorized as follows:

Analyzing Trends of Data

  • Air Quality Index (AQI) Analysis:
    • Assess the variation of AQI across different cities, seasons, and regions in India.
    • Examine the annual variation in pollutant concentrations in select cities to identify trends and outliers.
  • Temperature Analysis:
    • Investigate the variation of average, minimum, and maximum temperatures across cities and different elevations.
    • Compare temperature data among cities to identify similar patterns or unique trends.
  • Precipitation Analysis:
    • Explore the variation in precipitation across cities and different elevations.
    • Analyze annual or monthly precipitation levels and compare them across the cities.

Comparative Analysis

  • Temperature Data Comparison:
    • Use statistical tools and visualizations (like boxplots) to compare temperature distributions across cities.
  • Precipitation Data Comparison:
    • Employ bar charts or line graphs to compare precipitation levels and identify wetter and drier cities.
  • Seasonal Variations:
    • Examine seasonal changes in temperature and precipitation and compare these across cities.
  • Extreme Weather Events:
    • Identify and analyze unusual climatic events, discussing potential causes such as climate change.
  • Visualizations and Statistical Analysis:
    • Leverage a variety of plots and statistical tests to effectively convey and validate the comparative analysis.

Overview of Datasets for Data Exploration

Air Quality Dataset

Dataset Name Description Time Period Coverage Frequency Parameters
City Level Data Data for major cities in India 2015 to 2020 18 major cities in India Daily and hourly PM2.5, PM10, NO2, SO2, CO, O3, AQI
Station Level Data Localized air quality measurements at various stations within cities - Multiple stations within cities Hourly and daily Similar to city level data

Weather Dataset

Dataset Name Description Time Period Coverage Frequency Parameters
General Weather Data Weather data for major Indian cities 1990 to 2022 8 major cities in India Daily Min, max, average temperatures, precipitation

Specific Weather Dataset Files

File Name Description Time Period Coverage Frequency Parameters
weather_Rourkela.csv Weather data for Rourkela 2021 to 2022 Rourkela Daily Temperature, precipitation
weather_Bhubaneshwar.csv Weather data for Bhubaneshwar 1990 to 2022 Bhubaneshwar Hourly Temperature, precipitation
Rajasthan_1990_2022.csv Weather data for Jodhpur 1990 to 2022 Jodhpur Daily Temperature, precipitation
Mumbai_1990_2022_Santacruz.csv Weather data for Santacruz (Mumbai) 1990 to 2022 Santacruz, Mumbai Daily Temperature, precipitation
Lucknow_1990_2022.csv Weather data for Lucknow 1990 to 2022 Lucknow Hourly Temperature, precipitation
Delhi_NCR_1990_2022_Safdarjung.csv Weather data for Safdarjung (Delhi) 1990 to 2022 Safdarjung, Delhi Daily Temperature, precipitation
Chennai_1990_2022_Madras.csv Weather data for Chennai 1990 to 2022 Chennai Daily Temperature, precipitation
Bangalore_1990_2022_BangaloreCity.csv Weather data for Bangalore 1990 to 2022 Bangalore Hourly Temperature, precipitation
Station_GeoLocation_Longitude_Latitude_Elevation Geographical characteristics of stations - Stations in various cities - Longitude, latitude, elevation

External Data Source

  • Indian Cities Dataset from Kaggle (@indian-cities-kaggle): Provides latitude and longitude information for cities in the primary dataset.

Predictive Modeling

  • AQI Prediction:
    • Develop a predictive model to forecast the AQI of a city based on historical air quality and weather data.
    • Utilize relevant machine learning techniques to enhance the accuracy and reliability of the predictions.
  • Rainfall Prediction:
    • Create a model to predict rainfall patterns and intensity in different cities, using historical weather data.
    • Apply advanced forecasting methods to provide accurate and timely rainfall predictions.

Reporting and Further Research

  • Summarization and Insight Generation:
    • Present findings in a clear, concise manner, highlighting key insights and unexpected results.
  • Recommendations for Future Work:
    • Suggest areas for further research and discuss the implications of findings for policy-making, urban planning, agriculture, public health, and other relevant sectors.

This objective aims to provide a holistic understanding of air quality, climatic conditions, and rainfall patterns in India, drawing on comprehensive data analysis, comparative studies, and predictive modeling. The insights gained will be crucial for informing environmental policies, urban planning, and public health strategies.

End of Objective



Preliminary Analysis

In this section, we delve into the preliminary analysis of both the climate and air quality datasets. This includes our initial findings, data cleaning steps, and basic data explorations. The analysis begins with climate data, followed by air quality data, to provide a comprehensive overview.

Climate Data Analysis

  • Initial Observations:
    • Briefly describe the initial observations made from the climate dataset. This may include patterns, anomalies, or general trends noted in temperature and precipitation data across different cities.
  • Data Cleaning Steps:
    • Detail the steps taken to clean and preprocess the climate data. This might include handling missing values, correcting anomalies, normalizing data, or any other transformations applied.
    • Explain the rationale behind each step and how it aids in ensuring data accuracy and reliability.
  • Basic Explorations:
    • Present the initial explorations conducted on the climate data. This could involve:
      • Descriptive statistics to understand the distribution of temperature and precipitation.
      • Simple visualizations like line graphs or histograms to illustrate basic trends and patterns.
      • Comparison of temperature and precipitation data across different cities and time frames.

Air Quality Data Analysis

  • Initial Observations:
    • Summarize the early findings from the air quality dataset. Highlight the noticeable trends in AQI and pollutant concentrations across different cities and periods.
  • Data Cleaning Steps:
    • Discuss the procedures implemented to clean the air quality dataset. This may include filtering out irrelevant data, handling outliers, smoothing noisy data, etc.
    • Justify these steps and their importance in ensuring data quality.
  • Basic Explorations:
    • Share the preliminary analysis conducted on the air quality data. This could include:
      • Descriptive analysis of AQI and pollutants across different cities and times.
      • Initial graphical representations, such as bar charts or scatter plots, to show AQI trends and pollutant levels.
      • Early comparison of air quality across different cities and environmental conditions.

This preliminary analysis sets the stage for more in-depth investigations into both climate and air quality datasets, laying the groundwork for further statistical analysis, comparative studies, and predictive modeling.

Experiments with Data.

# Loading the datasets

# Bangalore
bangalore_df <- read.csv("d1/Bangalore_1990_2022_BangaloreCity.csv")

# Chennai
chennai_df <- read.csv("d1/Chennai_1990_2022_Madras.csv")

# Delhi
delhi_df <- read.csv("d1/Delhi_NCR_1990_2022_Safdarjung.csv")

# Lucknow
lucknow_df <- read.csv("d1/Lucknow_1990_2022.csv")

# Mumbai
mumbai_df <- read.csv("d1/Mumbai_1990_2022_Santacruz.csv")

# Rajasthan (Jodhpur)
rajasthan_df <- read.csv("d1/Rajasthan_1990_2022_Jodhpur.csv")

# Bhubaneswar
bhubaneswar_df <- read.csv("d1/weather_Bhubhneshwar_1990_2022.csv")

# Rourkela
rourkela_df <- read.csv("d1/weather_Rourkela_2021_2022.csv")

# Load the Station GeoLocation data
station_geo_df <- read.csv("d1/Station_GeoLocation_Longitute_Latitude_Elevation_EPSG_4326.csv")

# Preprocessing the Bangalore dataset
library(dplyr)
# Convert date column to Date format
bangalore_df$time <- as.Date(bangalore_df$time, format = "%d-%m-%Y")

# Filter data for 2015-2020
bangalore_df <- bangalore_df %>%
                filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")


#chennai
chennai_df$time <- as.Date(chennai_df$time, format = "%d-%m-%Y")
chennai_df <- chennai_df %>%
              filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")

#Delhi
delhi_df$time <- as.Date(delhi_df$time, format = "%d-%m-%Y")
delhi_df <- delhi_df %>%
            filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")

#Lucknow
lucknow_df$time <- as.Date(lucknow_df$time, format = "%d-%m-%Y")
lucknow_df <- lucknow_df %>%
              filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")

#MUmbai
mumbai_df$time <- as.Date(mumbai_df$time, format = "%d-%m-%Y")
mumbai_df <- mumbai_df %>%
              filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")

#rajasthan
rajasthan_df$time <- as.Date(rajasthan_df$time, format = "%d-%m-%Y")
rajasthan_df <- rajasthan_df %>%
                filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")

#bhub
bhubaneswar_df$time <- as.Date(bhubaneswar_df$time, format = "%d-%m-%Y")
bhubaneswar_df <- bhubaneswar_df %>%
                  filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")

#rourkela
rourkela_df$time <- as.Date(rourkela_df$time, format = "%d-%m-%Y")
rourkela_df <- rourkela_df %>%
               filter(format(time, "%Y") >= "2015" & format(time, "%Y") <= "2020")

# Handling missing values in Bangalore dataset
bangalore_df <- bangalore_df %>%
                mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                       tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                       tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                       prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

# Chennai
chennai_df <- chennai_df %>%
              mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                     tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                     tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                     prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

#Delhi

delhi_df <- delhi_df %>%
            mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                   tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                   tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                   prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

# Lucknow
lucknow_df <- lucknow_df %>%
              mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                     tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                     tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                     prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

#Mumbai Dataset
mumbai_df <- mumbai_df %>%
             mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                    tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                    tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                    prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

#rajasthan
rajasthan_df <- rajasthan_df %>%
                mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                       tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                       tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                       prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

#bhub
bhubaneswar_df <- bhubaneswar_df %>%
                  mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                         tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                         tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                         prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

#rourkela
rourkela_df <- rourkela_df %>%
               mutate(tavg = ifelse(is.na(tavg), mean(tavg, na.rm = TRUE), tavg),
                      tmin = ifelse(is.na(tmin), mean(tmin, na.rm = TRUE), tmin),
                      tmax = ifelse(is.na(tmax), mean(tmax, na.rm = TRUE), tmax),
                      prcp = ifelse(is.na(prcp), mean(prcp, na.rm = TRUE), prcp))

Summary Statistics:

# Example for Bangalore
summary_stats_bangalore <- bangalore_df %>%
                           summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                     Median_Tavg = median(tavg, na.rm = TRUE),
                                     SD_Tavg = sd(tavg, na.rm = TRUE),
                                     Mean_Prcp = mean(prcp, na.rm = TRUE),
                                     Median_Prcp = median(prcp, na.rm = TRUE),
                                     SD_Prcp = sd(prcp, na.rm = TRUE))

summary_stats_bangalore
##   Mean_Tavg Median_Tavg  SD_Tavg Mean_Prcp Median_Prcp  SD_Prcp
## 1  24.18695        23.8 2.226738  5.930541    5.930541 8.966999
#chennai

summary_stats_chennai <- chennai_df %>%
                         summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                   Median_Tavg = median(tavg, na.rm = TRUE),
                                   SD_Tavg = sd(tavg, na.rm = TRUE),
                                   Mean_Prcp = mean(prcp, na.rm = TRUE),
                                   Median_Prcp = median(prcp, na.rm = TRUE),
                                   SD_Prcp = sd(prcp, na.rm = TRUE))

# Delhi
summary_stats_delhi <- delhi_df %>%
                       summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                 Median_Tavg = median(tavg, na.rm = TRUE),
                                 SD_Tavg = sd(tavg, na.rm = TRUE),
                                 Mean_Prcp = mean(prcp, na.rm = TRUE),
                                 Median_Prcp = median(prcp, na.rm = TRUE),
                                 SD_Prcp = sd(prcp, na.rm = TRUE))

#Lucknow
summary_stats_lucknow <- lucknow_df %>%
                         summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                   Median_Tavg = median(tavg, na.rm = TRUE),
                                   SD_Tavg = sd(tavg, na.rm = TRUE),
                                   Mean_Prcp = mean(prcp, na.rm = TRUE),
                                   Median_Prcp = median(prcp, na.rm = TRUE),
                                   SD_Prcp = sd(prcp, na.rm = TRUE))
#Mumbai
summary_stats_mumbai <- mumbai_df %>%
                        summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                  Median_Tavg = median(tavg, na.rm = TRUE),
                                  SD_Tavg = sd(tavg, na.rm = TRUE),
                                  Mean_Prcp = mean(prcp, na.rm = TRUE),
                                  Median_Prcp = median(prcp, na.rm = TRUE),
                                  SD_Prcp = sd(prcp, na.rm = TRUE))

#Rajasthan
summary_stats_rajasthan <- rajasthan_df %>%
                           summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                     Median_Tavg = median(tavg, na.rm = TRUE),
                                     SD_Tavg = sd(tavg, na.rm = TRUE),
                                     Mean_Prcp = mean(prcp, na.rm = TRUE),
                                     Median_Prcp = median(prcp, na.rm = TRUE),
                                     SD_Prcp = sd(prcp, na.rm = TRUE))

#Bhubaneshwar
summary_stats_bhubaneswar <- bhubaneswar_df %>%
                             summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                       Median_Tavg = median(tavg, na.rm = TRUE),
                                       SD_Tavg = sd(tavg, na.rm = TRUE),
                                       Mean_Prcp = mean(prcp, na.rm = TRUE),
                                       Median_Prcp = median(prcp, na.rm = TRUE),
                                       SD_Prcp = sd(prcp, na.rm = TRUE))
#Rourkela
summary_stats_rourkela <- rourkela_df %>%
                          summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                    Median_Tavg = median(tavg, na.rm = TRUE),
                                    SD_Tavg = sd(tavg, na.rm = TRUE),
                                    Mean_Prcp = mean(prcp, na.rm = TRUE),
                                    Median_Prcp = median(prcp, na.rm = TRUE),
                                    SD_Prcp = sd(prcp, na.rm = TRUE))

Exploring the Air Quality across cities

aqi_city_day$City <-as_factor(aqi_city_day$City)
aqi_city_day$AQI_Bucket <-as_factor(aqi_city_day$AQI_Bucket)
aqi_city_day%>%nrow()
## [1] 29531
str(aqi_city_day$City)
##  Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_day$AQI_Bucket)
##  Factor w/ 6 levels "Poor","Very Poor",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_day%>%summary()
##         City            Date                PM2.5             PM10        
##  Ahmedabad: 2009   Min.   :2015-01-01   Min.   :  0.04   Min.   :   0.01  
##  Bengaluru: 2009   1st Qu.:2017-04-16   1st Qu.: 28.82   1st Qu.:  56.26  
##  Chennai  : 2009   Median :2018-08-05   Median : 48.57   Median :  95.68  
##  Delhi    : 2009   Mean   :2018-05-14   Mean   : 67.45   Mean   : 118.13  
##  Lucknow  : 2009   3rd Qu.:2019-09-03   3rd Qu.: 80.59   3rd Qu.: 149.75  
##  Mumbai   : 2009   Max.   :2020-07-01   Max.   :949.99   Max.   :1000.00  
##  (Other)  :17477                        NA's   :4598     NA's   :11140    
##        NO              NO2              NOx              NH3        
##  Min.   :  0.02   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  5.63   1st Qu.: 11.75   1st Qu.: 12.82   1st Qu.:  8.58  
##  Median :  9.89   Median : 21.69   Median : 23.52   Median : 15.85  
##  Mean   : 17.57   Mean   : 28.56   Mean   : 32.31   Mean   : 23.48  
##  3rd Qu.: 19.95   3rd Qu.: 37.62   3rd Qu.: 40.13   3rd Qu.: 30.02  
##  Max.   :390.68   Max.   :362.21   Max.   :467.63   Max.   :352.89  
##  NA's   :3582     NA's   :3585     NA's   :4185     NA's   :10328   
##        CO               SO2               O3            Benzene       
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.000  
##  1st Qu.:  0.510   1st Qu.:  5.67   1st Qu.: 18.86   1st Qu.:  0.120  
##  Median :  0.890   Median :  9.16   Median : 30.84   Median :  1.070  
##  Mean   :  2.249   Mean   : 14.53   Mean   : 34.49   Mean   :  3.281  
##  3rd Qu.:  1.450   3rd Qu.: 15.22   3rd Qu.: 45.57   3rd Qu.:  3.080  
##  Max.   :175.810   Max.   :193.86   Max.   :257.73   Max.   :455.030  
##  NA's   :2059      NA's   :3854     NA's   :4022     NA's   :5623     
##     Toluene            Xylene            AQI                AQI_Bucket  
##  Min.   :  0.000   Min.   :  0.00   Min.   :  13.0   Poor        :2781  
##  1st Qu.:  0.600   1st Qu.:  0.14   1st Qu.:  81.0   Very Poor   :2337  
##  Median :  2.970   Median :  0.98   Median : 118.0   Severe      :1338  
##  Mean   :  8.701   Mean   :  3.07   Mean   : 166.5   Moderate    :8829  
##  3rd Qu.:  9.150   3rd Qu.:  3.35   3rd Qu.: 208.0   Satisfactory:8224  
##  Max.   :454.850   Max.   :170.37   Max.   :2049.0   Good        :1341  
##  NA's   :8041      NA's   :18109    NA's   :4681     NA's        :4681
#City Hour

aqi_city_hour$City <-as_factor(aqi_city_hour$City)
aqi_city_hour$AQI_Bucket <-as_factor(aqi_city_hour$AQI_Bucket)
str(aqi_city_hour$City)
##  Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_hour$AQI_Bucket)
##  Factor w/ 6 levels "Poor","Moderate",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_hour%>%nrow()
## [1] 707875
aqi_city_hour%>%summary()
##         City           Datetime                          PM2.5        
##  Ahmedabad: 48192   Min.   :2015-01-01 01:00:00.00   Min.   :   0.01  
##  Bengaluru: 48192   1st Qu.:2017-04-15 23:00:00.00   1st Qu.:  26.20  
##  Chennai  : 48192   Median :2018-08-04 20:00:00.00   Median :  46.42  
##  Delhi    : 48192   Mean   :2018-05-14 02:41:03.45   Mean   :  67.62  
##  Lucknow  : 48192   3rd Qu.:2019-09-02 14:00:00.00   3rd Qu.:  79.49  
##  Mumbai   : 48192   Max.   :2020-07-01 00:00:00.00   Max.   : 999.99  
##  (Other)  :418723                                    NA's   :145088   
##       PM10               NO              NO2              NOx        
##  Min.   :   0.01   Min.   :  0.01   Min.   :  0.01   Min.   :  0.00  
##  1st Qu.:  52.38   1st Qu.:  3.84   1st Qu.: 10.81   1st Qu.: 10.66  
##  Median :  91.50   Median :  7.96   Median : 20.32   Median : 20.79  
##  Mean   : 119.08   Mean   : 17.42   Mean   : 28.89   Mean   : 32.29  
##  3rd Qu.: 147.52   3rd Qu.: 16.15   3rd Qu.: 36.35   3rd Qu.: 37.15  
##  Max.   :1000.00   Max.   :499.99   Max.   :499.51   Max.   :498.61  
##  NA's   :296737    NA's   :116632   NA's   :117122   NA's   :123224  
##       NH3               CO              SO2               O3        
##  Min.   :  0.01   Min.   :  0.00   Min.   :  0.01   Min.   :  0.01  
##  1st Qu.:  8.12   1st Qu.:  0.42   1st Qu.:  4.88   1st Qu.: 13.42  
##  Median : 15.38   Median :  0.80   Median :  8.37   Median : 26.24  
##  Mean   : 23.61   Mean   :  2.18   Mean   : 14.04   Mean   : 34.80  
##  3rd Qu.: 29.23   3rd Qu.:  1.37   3rd Qu.: 14.78   3rd Qu.: 47.62  
##  Max.   :499.97   Max.   :498.57   Max.   :199.96   Max.   :497.62  
##  NA's   :272542   NA's   :86517    NA's   :130373   NA's   :129208  
##     Benzene          Toluene           Xylene            AQI        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0    Min.   :   8.0  
##  1st Qu.:  0.05   1st Qu.:  0.37   1st Qu.:  0.1    1st Qu.:  79.0  
##  Median :  0.86   Median :  2.59   Median :  0.8    Median : 116.0  
##  Mean   :  3.09   Mean   :  8.66   Mean   :  3.1    Mean   : 166.4  
##  3rd Qu.:  2.75   3rd Qu.:  8.41   3rd Qu.:  3.1    3rd Qu.: 208.0  
##  Max.   :498.07   Max.   :499.40   Max.   :500.0    Max.   :3133.0  
##  NA's   :163646   NA's   :220607   NA's   :455829   NA's   :129080  
##         AQI_Bucket    
##  Poor        : 66654  
##  Moderate    :198991  
##  Very Poor   : 57455  
##  Severe      : 27650  
##  Satisfactory:189434  
##  Good        : 38611  
##  NA's        :129080
#Preliminary exploration of stnt_day data
#Convert the stationId and AQI_Bucket into factors
aqi_stnts$StationId <- as_factor(aqi_stnts$StationId)
aqi_stnts$City <- as_factor(aqi_stnts$City)
aqi_stnts$State <- as_factor(aqi_stnts$State)
aqi_stnts$Status <- as_factor(aqi_stnts$Status)

str(aqi_stnts$StationId)
##  Factor w/ 230 levels "AP001","AP002",..: 1 2 3 4 5 6 7 8 9 10 ...
str(aqi_stnts$City)
##  Factor w/ 127 levels "Amaravati","Rajamahendravaram",..: 1 2 3 4 5 6 7 7 8 9 ...
str(aqi_stnts$State)
##  Factor w/ 21 levels "Andhra Pradesh",..: 1 1 1 1 1 2 3 3 3 3 ...
aqi_stnts %>% nrow()
## [1] 230
aqi_stnts %>% summary()
##    StationId   StationName               City                State   
##  AP001  :  1   Length:230         Delhi    : 38   Delhi         :38  
##  AP002  :  1   Class :character   Bengaluru: 10   Haryana       :29  
##  AP003  :  1   Mode  :character   Mumbai   : 10   Uttar Pradesh :26  
##  AP004  :  1                      Kolkata  :  7   Maharashtra   :22  
##  AP005  :  1                      Patna    :  6   Karnataka     :20  
##  AS001  :  1                      Hyderabad:  6   Madhya Pradesh:16  
##  (Other):224                      (Other)  :153   (Other)       :79  
##       Status   
##  Active  :131  
##  Inactive:  2  
##  NA's    : 97  
##                
##                
##                
## 

City by Day

# Preliminary exploration of city_day data
# Convert the city and AQI_Bucket into factors
aqi_city_day$City <-as_factor(aqi_city_day$City)
aqi_city_day$AQI_Bucket <-as_factor(aqi_city_day$AQI_Bucket)
aqi_city_day%>%nrow()
## [1] 29531
str(aqi_city_day$City)
##  Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_day$AQI_Bucket)
##  Factor w/ 6 levels "Poor","Very Poor",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_day%>%summary()
##         City            Date                PM2.5             PM10        
##  Ahmedabad: 2009   Min.   :2015-01-01   Min.   :  0.04   Min.   :   0.01  
##  Bengaluru: 2009   1st Qu.:2017-04-16   1st Qu.: 28.82   1st Qu.:  56.26  
##  Chennai  : 2009   Median :2018-08-05   Median : 48.57   Median :  95.68  
##  Delhi    : 2009   Mean   :2018-05-14   Mean   : 67.45   Mean   : 118.13  
##  Lucknow  : 2009   3rd Qu.:2019-09-03   3rd Qu.: 80.59   3rd Qu.: 149.75  
##  Mumbai   : 2009   Max.   :2020-07-01   Max.   :949.99   Max.   :1000.00  
##  (Other)  :17477                        NA's   :4598     NA's   :11140    
##        NO              NO2              NOx              NH3        
##  Min.   :  0.02   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  5.63   1st Qu.: 11.75   1st Qu.: 12.82   1st Qu.:  8.58  
##  Median :  9.89   Median : 21.69   Median : 23.52   Median : 15.85  
##  Mean   : 17.57   Mean   : 28.56   Mean   : 32.31   Mean   : 23.48  
##  3rd Qu.: 19.95   3rd Qu.: 37.62   3rd Qu.: 40.13   3rd Qu.: 30.02  
##  Max.   :390.68   Max.   :362.21   Max.   :467.63   Max.   :352.89  
##  NA's   :3582     NA's   :3585     NA's   :4185     NA's   :10328   
##        CO               SO2               O3            Benzene       
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.000  
##  1st Qu.:  0.510   1st Qu.:  5.67   1st Qu.: 18.86   1st Qu.:  0.120  
##  Median :  0.890   Median :  9.16   Median : 30.84   Median :  1.070  
##  Mean   :  2.249   Mean   : 14.53   Mean   : 34.49   Mean   :  3.281  
##  3rd Qu.:  1.450   3rd Qu.: 15.22   3rd Qu.: 45.57   3rd Qu.:  3.080  
##  Max.   :175.810   Max.   :193.86   Max.   :257.73   Max.   :455.030  
##  NA's   :2059      NA's   :3854     NA's   :4022     NA's   :5623     
##     Toluene            Xylene            AQI                AQI_Bucket  
##  Min.   :  0.000   Min.   :  0.00   Min.   :  13.0   Poor        :2781  
##  1st Qu.:  0.600   1st Qu.:  0.14   1st Qu.:  81.0   Very Poor   :2337  
##  Median :  2.970   Median :  0.98   Median : 118.0   Severe      :1338  
##  Mean   :  8.701   Mean   :  3.07   Mean   : 166.5   Moderate    :8829  
##  3rd Qu.:  9.150   3rd Qu.:  3.35   3rd Qu.: 208.0   Satisfactory:8224  
##  Max.   :454.850   Max.   :170.37   Max.   :2049.0   Good        :1341  
##  NA's   :8041      NA's   :18109    NA's   :4681     NA's        :4681

City by Hour

# Preliminary exploration of city_hour data
# Convert the city and AQI_Bucket into factors
aqi_city_hour$City <-as_factor(aqi_city_hour$City)
aqi_city_hour$AQI_Bucket <-as_factor(aqi_city_hour$AQI_Bucket)
str(aqi_city_hour$City)
##  Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_city_hour$AQI_Bucket)
##  Factor w/ 6 levels "Poor","Moderate",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_city_hour%>%nrow()
## [1] 707875
aqi_city_hour%>%summary()
##         City           Datetime                          PM2.5        
##  Ahmedabad: 48192   Min.   :2015-01-01 01:00:00.00   Min.   :   0.01  
##  Bengaluru: 48192   1st Qu.:2017-04-15 23:00:00.00   1st Qu.:  26.20  
##  Chennai  : 48192   Median :2018-08-04 20:00:00.00   Median :  46.42  
##  Delhi    : 48192   Mean   :2018-05-14 02:41:03.45   Mean   :  67.62  
##  Lucknow  : 48192   3rd Qu.:2019-09-02 14:00:00.00   3rd Qu.:  79.49  
##  Mumbai   : 48192   Max.   :2020-07-01 00:00:00.00   Max.   : 999.99  
##  (Other)  :418723                                    NA's   :145088   
##       PM10               NO              NO2              NOx        
##  Min.   :   0.01   Min.   :  0.01   Min.   :  0.01   Min.   :  0.00  
##  1st Qu.:  52.38   1st Qu.:  3.84   1st Qu.: 10.81   1st Qu.: 10.66  
##  Median :  91.50   Median :  7.96   Median : 20.32   Median : 20.79  
##  Mean   : 119.08   Mean   : 17.42   Mean   : 28.89   Mean   : 32.29  
##  3rd Qu.: 147.52   3rd Qu.: 16.15   3rd Qu.: 36.35   3rd Qu.: 37.15  
##  Max.   :1000.00   Max.   :499.99   Max.   :499.51   Max.   :498.61  
##  NA's   :296737    NA's   :116632   NA's   :117122   NA's   :123224  
##       NH3               CO              SO2               O3        
##  Min.   :  0.01   Min.   :  0.00   Min.   :  0.01   Min.   :  0.01  
##  1st Qu.:  8.12   1st Qu.:  0.42   1st Qu.:  4.88   1st Qu.: 13.42  
##  Median : 15.38   Median :  0.80   Median :  8.37   Median : 26.24  
##  Mean   : 23.61   Mean   :  2.18   Mean   : 14.04   Mean   : 34.80  
##  3rd Qu.: 29.23   3rd Qu.:  1.37   3rd Qu.: 14.78   3rd Qu.: 47.62  
##  Max.   :499.97   Max.   :498.57   Max.   :199.96   Max.   :497.62  
##  NA's   :272542   NA's   :86517    NA's   :130373   NA's   :129208  
##     Benzene          Toluene           Xylene            AQI        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0    Min.   :   8.0  
##  1st Qu.:  0.05   1st Qu.:  0.37   1st Qu.:  0.1    1st Qu.:  79.0  
##  Median :  0.86   Median :  2.59   Median :  0.8    Median : 116.0  
##  Mean   :  3.09   Mean   :  8.66   Mean   :  3.1    Mean   : 166.4  
##  3rd Qu.:  2.75   3rd Qu.:  8.41   3rd Qu.:  3.1    3rd Qu.: 208.0  
##  Max.   :498.07   Max.   :499.40   Max.   :500.0    Max.   :3133.0  
##  NA's   :163646   NA's   :220607   NA's   :455829   NA's   :129080  
##         AQI_Bucket    
##  Poor        : 66654  
##  Moderate    :198991  
##  Very Poor   : 57455  
##  Severe      : 27650  
##  Satisfactory:189434  
##  Good        : 38611  
##  NA's        :129080

Station By Day

#Preliminary exploration of stnt_day data
#Convert the stationId and AQI_Bucket into factors
aqi_stnt_day$StationId <-as_factor(aqi_stnt_day$StationId)
aqi_stnt_day$AQI_Bucket <-as_factor(aqi_stnt_day$AQI_Bucket)

str(aqi_stnt_day$StationId)
##  Factor w/ 110 levels "AP001","AP005",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_stnt_day$AQI_Bucket)
##  Factor w/ 6 levels "Moderate","Poor",..: NA 1 1 1 1 1 1 1 1 2 ...
aqi_stnt_day%>%nrow()
## [1] 108035
aqi_stnt_day%>%summary()
##    StationId          Date                PM2.5              PM10        
##  DL007  : 2009   Min.   :2015-01-01   Min.   :   0.02   Min.   :   0.01  
##  DL008  : 2009   1st Qu.:2017-10-14   1st Qu.:  31.88   1st Qu.:  70.15  
##  DL013  : 2009   Median :2018-12-02   Median :  55.95   Median : 122.09  
##  DL021  : 2009   Mean   :2018-08-17   Mean   :  80.27   Mean   : 157.97  
##  DL033  : 2009   3rd Qu.:2019-10-16   3rd Qu.:  99.92   3rd Qu.: 208.67  
##  GJ001  : 2009   Max.   :2020-07-01   Max.   :1000.00   Max.   :1000.00  
##  (Other):95981                        NA's   :21625     NA's   :42706    
##        NO              NO2              NOx              NH3        
##  Min.   :  0.01   Min.   :  0.01   Min.   :  0.00   Min.   :  0.01  
##  1st Qu.:  4.84   1st Qu.: 15.09   1st Qu.: 13.97   1st Qu.: 11.90  
##  Median : 10.29   Median : 27.21   Median : 26.66   Median : 23.59  
##  Mean   : 23.12   Mean   : 35.24   Mean   : 41.20   Mean   : 28.73  
##  3rd Qu.: 24.98   3rd Qu.: 46.93   3rd Qu.: 50.50   3rd Qu.: 38.14  
##  Max.   :470.00   Max.   :448.05   Max.   :467.63   Max.   :418.90  
##  NA's   :17106    NA's   :16547    NA's   :15500    NA's   :48105   
##        CO               SO2               O3            Benzene       
##  Min.   :  0.000   Min.   :  0.01   Min.   :  0.01   Min.   :  0.000  
##  1st Qu.:  0.530   1st Qu.:  5.04   1st Qu.: 18.89   1st Qu.:  0.160  
##  Median :  0.910   Median :  8.95   Median : 30.84   Median :  1.210  
##  Mean   :  1.606   Mean   : 12.26   Mean   : 38.13   Mean   :  3.358  
##  3rd Qu.:  1.450   3rd Qu.: 14.92   3rd Qu.: 47.14   3rd Qu.:  3.610  
##  Max.   :175.810   Max.   :195.65   Max.   :963.00   Max.   :455.030  
##  NA's   :12998     NA's   :25204    NA's   :25568    NA's   :31455    
##     Toluene           Xylene            AQI                AQI_Bucket   
##  Min.   :  0.00   Min.   :  0.00   Min.   :   8.0   Moderate    :29417  
##  1st Qu.:  0.69   1st Qu.:  0.00   1st Qu.:  86.0   Poor        :11493  
##  Median :  4.33   Median :  0.40   Median : 132.0   Very Poor   :11762  
##  Mean   : 15.35   Mean   :  2.42   Mean   : 179.7   Satisfactory:23636  
##  3rd Qu.: 17.51   3rd Qu.:  2.11   3rd Qu.: 254.0   Good        : 5510  
##  Max.   :454.85   Max.   :170.37   Max.   :2049.0   Severe      : 5207  
##  NA's   :38702    NA's   :85137    NA's   :21010    NA's        :21010

Station By Hour

#Preliminary exploration of stnt_day data
#Convert the stationId and AQI_Bucket into factors
aqi_stnt_hour$StationId <-as_factor(aqi_stnt_hour$StationId)
aqi_stnt_hour$AQI_Bucket <-as_factor(aqi_stnt_hour$AQI_Bucket)

str(aqi_stnt_hour$StationId)
##  Factor w/ 110 levels "AP001","AP005",..: 1 1 1 1 1 1 1 1 1 1 ...
str(aqi_stnt_hour$AQI_Bucket)
##  Factor w/ 6 levels "Moderate","Poor",..: NA NA NA NA NA NA NA NA NA NA ...
aqi_stnt_hour%>%nrow()
## [1] 2589083
aqi_stnt_hour%>%summary()
##    StationId          Datetime                          PM2.5       
##  DL007  :  48192   Min.   :2015-01-01 01:00:00.00   Min.   :   0.0  
##  DL008  :  48192   1st Qu.:2017-10-13 20:00:00.00   1st Qu.:  28.2  
##  DL013  :  48192   Median :2018-12-02 06:00:00.00   Median :  52.6  
##  DL021  :  48192   Mean   :2018-08-17 09:52:35.77   Mean   :  80.9  
##  DL033  :  48192   3rd Qu.:2019-10-15 06:00:00.00   3rd Qu.:  97.7  
##  GJ001  :  48192   Max.   :2020-07-01 00:00:00.00   Max.   :1000.0  
##  (Other):2299931                                    NA's   :647689  
##       PM10               NO              NO2              NOx        
##  Min.   :   0.0    Min.   :  0.0    Min.   :  0.0    Min.   :  0.0   
##  1st Qu.:  64.0    1st Qu.:  3.0    1st Qu.: 13.1    1st Qu.: 11.3   
##  Median : 116.2    Median :  7.2    Median : 24.8    Median : 22.9   
##  Mean   : 158.5    Mean   : 22.8    Mean   : 35.2    Mean   : 40.6   
##  3rd Qu.: 204.0    3rd Qu.: 18.6    3rd Qu.: 45.5    3rd Qu.: 45.7   
##  Max.   :1000.0    Max.   :500.0    Max.   :500.0    Max.   :500.0   
##  NA's   :1119252   NA's   :553711   NA's   :528973   NA's   :490808  
##       NH3                CO              SO2               O3        
##  Min.   :  0.0     Min.   :  0.0    Min.   :  0.0    Min.   :  0.0   
##  1st Qu.: 11.2     1st Qu.:  0.4    1st Qu.:  4.2    1st Qu.: 11.0   
##  Median : 22.4     Median :  0.8    Median :  8.2    Median : 24.8   
##  Mean   : 28.7     Mean   :  1.5    Mean   : 12.1    Mean   : 38.1   
##  3rd Qu.: 37.8     3rd Qu.:  1.4    3rd Qu.: 14.5    3rd Qu.: 49.5   
##  Max.   :500.0     Max.   :498.6    Max.   :200.0    Max.   :997.0   
##  NA's   :1236618   NA's   :499302   NA's   :742737   NA's   :725973  
##     Benzene          Toluene            Xylene             AQI        
##  Min.   :  0.0    Min.   :  0.0     Min.   :  0.0     Min.   :   5.0  
##  1st Qu.:  0.1    1st Qu.:  0.3     1st Qu.:  0.0     1st Qu.:  84.0  
##  Median :  1.0    Median :  3.4     Median :  0.2     Median : 131.0  
##  Mean   :  3.3    Mean   : 14.9     Mean   :  2.4     Mean   : 180.2  
##  3rd Qu.:  3.2    3rd Qu.: 15.1     3rd Qu.:  1.8     3rd Qu.: 259.0  
##  Max.   :498.1    Max.   :500.0     Max.   :500.0     Max.   :3133.0  
##  NA's   :861579   NA's   :1042366   NA's   :2075104   NA's   :570190  
##         AQI_Bucket    
##  Moderate    :675008  
##  Poor        :239990  
##  Very Poor   :301150  
##  Satisfactory:530164  
##  Good        :152113  
##  Severe      :120468  
##  NA's        :570190

Discussion on Premilinary Analysis.

Temperature & Precipitation Dataset

1. Structural Variations

  • Data from Rourkela and Bhubhaneswar includes additional fields such as wind direction, wind speed, and snow.
  • This necessitates different handling methods for comparative analysis.

2. Data Range Inconsistencies

  • Most cities have data from 1990-2022, but Rourkela’s data is only for 2021, and Bhubhaneswar’s data goes up to September 2022.
  • Range filtering will be required for a uniform time-scale analysis.

3. Missing Elevation Data

  • Elevation data for Bhubhaneswar and Rourkela is missing in the geolocations table.

AQI Dataset

1. Factorization of AQI_Bucket

  • The AQI_Bucket has been categorized into six levels: Good, Satisfactory, Moderate, Poor, Very Poor, and Severe.
  • This categorization is consistent across city_day, city_hour, station_day, and station_hour datasets.

2. Measured Particulate Matter

  • The datasets measure AQI based on 12 particulate matter types: PM2.5, PM10, NO, NO2, NOx, NH3, CO, SO2, O3, Benzene, Toluene, and Xylene.
  • Consistent measurement across all tables facilitates comprehensive air quality analysis.

3. Consistency in City and StationId Coverage

  • The same 26 cities are covered in city_day and city_hour datasets.
  • Station_day and station_hour datasets are consistent with 110 station IDs each.

4. Relation between AQI and AQI_Bucket

  • If AQI data is missing, the AQI_Bucket is also NA, indicating no further data cleaning is needed for these fields in terms of their interrelation.

5. Mismatch in Number of Stations

  • The stations table includes 230 stations, more than the 110 in station_day and station_hour datasets.
  • This indicates that station_day and station_hour datasets include data from a subset of stations.
Next Steps
  • Exploratory Data Analysis (EDA):
    • Further exploration of datasets to identify trends, patterns, and anomalies.
    • Examining temporal (daily vs. hourly) and spatial (city-wise and station-wise) AQI trends.
  • Correlation Analysis:
    • Investigating the correlation between city_day and station_day datasets.
    • Exploring the impact of environmental factors like temperature and precipitation on AQI.
  • Data Filtering and Normalization:
    • Applying filters and normalization techniques for structural differences and data range inconsistencies in the temperature and precipitation dataset.

The focus will be on uncovering insights into air quality trends and their temporal and spatial dynamics, along with the influence of environmental factors.

Detailed Analysis: Data Exploration and Processing

In this section, we focus on preprocessing and analyzing the Temperature & Precipitation dataset to discern climatic trends across various Indian cities. Key steps include merging geographic data (latitude, longitude, elevation) and consolidating data from individual cities into one comprehensive dataframe. We enhance the dataset with temporal features (month, year), geographical classifications (Coastal/Non-Coastal regions based on elevation), and seasonal categories (Summer, Winter, Rainy). Additionally, we identify day types (weekdays/weekends) and integrate city information, transforming the City field into a factor.

The processing extends to merging each city’s dataset with geolocation data, followed by combining these into a single merged_weather dataframe. We classify cities into Coastal or Non-Coastal regions and clean the data by removing rows with missing values in key columns. For a more granular analysis, we compute monthly averages of temperature and precipitation across different years.

Our analysis includes visualizations and trend examinations of annual precipitation and temperature across cities. We observe general trends like increasing precipitation since 2004 and rising temperatures, particularly in Delhi and Lucknow. The analysis also reveals distinct climatic differences between Coastal and Non-Coastal cities, with Coastal regions exhibiting higher temperatures and precipitation levels. This comprehensive exploration provides valuable insights into the geographical impact on climate patterns, highlighting significant variances in temperature and precipitation across different regions.

Trend Analysis

Temperature and Precipitation Trends Over Time

library(ggplot2)

# Function to calculate annual trends
calculate_annual_trends <- function(df) {
  df %>% 
    group_by(Year = format(time, "%Y")) %>%
    summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
              Total_Prcp = sum(prcp, na.rm = TRUE))
}

annual_trends_bangalore <- bangalore_df %>%
                           group_by(Year = format(time, "%Y")) %>%
                           summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                     Total_Prcp = sum(prcp, na.rm = TRUE))

# Temperature Trend Plot for Bangalore
ggplot(annual_trends_bangalore, aes(x = Year, y = Mean_Tavg)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Mean Temperature Trend in Bangalore (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Precipitation Trend Plot for Bangalore
ggplot(annual_trends_bangalore, aes(x = Year, y = Total_Prcp)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Total Precipitation Trend in Bangalore (2015-2020)",
         x = "Year",
         y = "Total Precipitation (mm)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Annual trends for Chennai
annual_trends_chennai <- chennai_df %>%
                         group_by(Year = format(time, "%Y")) %>%
                         summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                   Total_Prcp = sum(prcp, na.rm = TRUE))

# Temperature Trend Plot for Chennai
ggplot(annual_trends_chennai, aes(x = Year, y = Mean_Tavg)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Mean Temperature Trend in Chennai (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Precipitation Trend Plot for Chennai
ggplot(annual_trends_chennai, aes(x = Year, y = Total_Prcp)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Total Precipitation Trend in Chennai (2015-2020)",
         x = "Year",
         y = "Total Precipitation (mm)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Annual trends for Delhi
annual_trends_delhi <- delhi_df %>%
                       group_by(Year = format(time, "%Y")) %>%
                       summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                 Total_Prcp = sum(prcp, na.rm = TRUE))

# Temperature Trend Plot for Delhi
ggplot(annual_trends_delhi, aes(x = Year, y = Mean_Tavg)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Mean Temperature Trend in Delhi (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Precipitation Trend Plot for Delhi
ggplot(annual_trends_delhi, aes(x = Year, y = Total_Prcp)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Total Precipitation Trend in Delhi (2015-2020)",
         x = "Year",
         y = "Total Precipitation (mm)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Annual trends for Lucknow
annual_trends_lucknow <- lucknow_df %>%
                         group_by(Year = format(time, "%Y")) %>%
                         summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                   Total_Prcp = sum(prcp, na.rm = TRUE))

# Temperature Trend Plot for Lucknow
ggplot(annual_trends_lucknow, aes(x = Year, y = Mean_Tavg)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Mean Temperature Trend in Lucknow (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Precipitation Trend Plot for Lucknow
ggplot(annual_trends_lucknow, aes(x = Year, y = Total_Prcp)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Total Precipitation Trend in Lucknow (2015-2020)",
         x = "Year",
         y = "Total Precipitation (mm)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Annual trends for Mumbai
annual_trends_mumbai <- mumbai_df %>%
                        group_by(Year = format(time, "%Y")) %>%
                        summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                  Total_Prcp = sum(prcp, na.rm = TRUE))

# Temperature Trend Plot for Mumbai
ggplot(annual_trends_mumbai, aes(x = Year, y = Mean_Tavg)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Mean Temperature Trend in Mumbai (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Precipitation Trend Plot for Mumbai
ggplot(annual_trends_mumbai, aes(x = Year, y = Total_Prcp)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Total Precipitation Trend in Mumbai (2015-2020)",
         x = "Year",
         y = "Total Precipitation (mm)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Annual trends for Rajasthan
annual_trends_rajasthan <- rajasthan_df %>%
                           group_by(Year = format(time, "%Y")) %>%
                           summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
                                     Total_Prcp = sum(prcp, na.rm = TRUE))

# Temperature Trend Plot for Rajasthan
ggplot(annual_trends_rajasthan, aes(x = Year, y = Mean_Tavg)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Mean Temperature Trend in Rajasthan (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Precipitation Trend Plot for Rajasthan
ggplot(annual_trends_rajasthan, aes(x = Year, y = Total_Prcp)) +
    geom_point() +
    geom_line() +
    labs(title = "Annual Total Precipitation Trend in Rajasthan (2015-2020)",
         x = "Year",
         y = "Total Precipitation (mm)") +
    theme_minimal()
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Comparative Analysis

# Select and rename columns (if needed) for each city
bangalore_df <- bangalore_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Bangalore")
chennai_df <- chennai_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Chennai")
delhi_df <- delhi_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Delhi")
lucknow_df <- lucknow_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Lucknow")
mumbai_df <- mumbai_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Mumbai")
rajasthan_df <- rajasthan_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Rajasthan")
bhubaneswar_df <- bhubaneswar_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Bhubaneswar")
rourkela_df <- rourkela_df %>% select(time, tavg, tmin, tmax, prcp) %>% mutate(City = "Rourkela")

# Combine all city datasets into one dataframe
all_cities_df <- rbind(bangalore_df, chennai_df, delhi_df, lucknow_df, mumbai_df, rajasthan_df, bhubaneswar_df, rourkela_df)

# Boxplot for average temperatures
ggplot(all_cities_df, aes(x = City, y = tavg, fill = City)) +
  geom_boxplot() +
  labs(title = "Comparison of Average Temperatures Across Cities",
       x = "City",
       y = "Average Temperature (°C)") +
  theme_minimal() +
  theme(legend.position = "none")
## Warning: Removed 78 rows containing non-finite values (`stat_boxplot()`).

# Boxplot for minimum temperatures
ggplot(all_cities_df, aes(x = City, y = tmin, fill = City)) +
  geom_boxplot() +
  labs(title = "Comparison of Minimum Temperatures Across Cities",
       x = "City",
       y = "Minimum Temperature (°C)") +
  theme_minimal() +
  theme(legend.position = "none")
## Warning: Removed 2090 rows containing non-finite values (`stat_boxplot()`).

Maximum Temperature Comparison

# Boxplot for maximum temperatures
ggplot(all_cities_df, aes(x = City, y = tmax, fill = City)) +
  geom_boxplot() +
  labs(title = "Comparison of Maximum Temperatures Across Cities",
       x = "City",
       y = "Maximum Temperature (°C)") +
  theme_minimal() +
  theme(legend.position = "none")
## Warning: Removed 891 rows containing non-finite values (`stat_boxplot()`).

Comparative Analysis: Precipitation Data Among Cities

# Creating boxplots for precipitation data
ggplot(all_cities_df, aes(x = City, y = prcp, fill = City)) +
    geom_boxplot() +
    labs(title = "Comparison of Precipitation Among Cities (2015-2020)",
         x = "City",
         y = "Precipitation (mm)") +
    theme_minimal() +
    theme(legend.position = "none")
## Warning: Removed 5097 rows containing non-finite values (`stat_boxplot()`).

Year-Over-Year Climate Trends Across Cities

# Calculate annual mean temperatures for each city
annual_mean_temps <- all_cities_df %>%
    group_by(City, Year = format(time, "%Y")) %>%
    summarise(Mean_Tavg = mean(tavg, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Plotting the temperature trends
ggplot(annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
    geom_line() +
    labs(title = "Year-Over-Year Mean Temperature Trends Across Cities (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal() +
    theme(legend.position = "bottom")

# Plotting temperature trends with improved axis readability
ggplot(annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
    geom_line() +
    labs(title = "Year-Over-Year Mean Temperature Trends Across Cities (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)") +
    theme_minimal() +
    theme(legend.position = "bottom") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Adjusting x-axis labels

# Calculate total annual precipitation for each city
annual_precipitation <- all_cities_df %>%
    group_by(City, Year = format(time, "%Y")) %>%
    summarise(Total_Prcp = sum(prcp, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Base plot with line plot for temperatures
p <- ggplot() +
    geom_line(data = annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
    labs(title = "Annual Mean Temperature and Total Precipitation Trends (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)")

# Adding the bar plot for precipitation
p + geom_bar(data = annual_precipitation, aes(x = Year, y = Total_Prcp, fill = City), stat = "identity", position = "dodge", alpha = 0.5) +
    scale_y_continuous(sec.axis = sec_axis(~ . / 10, name = "Total Precipitation (mm)")) # Adjust scale and axis label for precipitation

# Base plot with line plot for temperatures
p <- ggplot() +
    geom_line(data = annual_mean_temps, aes(x = Year, y = Mean_Tavg, group = City, color = City)) +
    labs(title = "Annual Mean Temperature and Total Precipitation Trends (2015-2020)",
         x = "Year",
         y = "Mean Temperature (°C)")

Seasonal Variation Analysis

Analyzing Seasonal Changes in Temperature and Precipitation

library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)

# Function to assign seasons to months
get_season <- function(month) {
  case_when(
    month %in% c(3, 4, 5) ~ "Spring",
    month %in% c(6, 7, 8) ~ "Summer",
    month %in% c(9, 10, 11) ~ "Autumn",
    month %in% c(12, 1, 2) ~ "Winter"
  )
}

# Adding a Season column to the dataset
all_cities_df <- all_cities_df %>%
  mutate(Month = month(time),
         Season = get_season(Month))

# Calculating seasonal mean temperature and total precipitation
seasonal_stats <- all_cities_df %>%
  group_by(City, Season) %>%
  summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
            Total_Prcp = sum(prcp, na.rm = TRUE), .groups = 'drop')

# Plotting Seasonal Temperature Variations
ggplot(seasonal_stats, aes(x = Season, y = Mean_Tavg, fill = City)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Seasonal Mean Temperature Variations Across Cities",
       x = "Season",
       y = "Mean Temperature (°C)") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Plotting Seasonal Precipitation Variations
ggplot(seasonal_stats, aes(x = Season, y = Total_Prcp, fill = City)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Seasonal Total Precipitation Variations Across Cities",
       x = "Season",
       y = "Total Precipitation (mm)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Correlation Analysis Between Temperature and Precipitation

Investigating the Relationship Between Average Temperature and Total Precipitation

library(dplyr)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(DT)

# Calculating annual mean temperature and total precipitation for each city
annual_climate_data <- all_cities_df %>%
  group_by(City, Year = format(time, "%Y")) %>%
  summarise(Mean_Tavg = mean(tavg, na.rm = TRUE),
            Total_Prcp = sum(prcp, na.rm = TRUE), .groups = 'drop')

# Enhanced Correlation Plot
ggpairs(annual_climate_data, columns = c("Mean_Tavg", "Total_Prcp"), ggplot2::aes(colour = City)) +
  labs(title = "Enhanced Correlation Matrix Between Mean Temperature and Total Precipitation Across Cities")

# Interactive Table of Correlations
annual_climate_data %>%
  group_by(City) %>%
  summarise(Correlation = cor(Mean_Tavg, Total_Prcp, use = "complete.obs")) %>%
  datatable(options = list(pageLength = 10))

Extreme Weather Events Analysis

Setting Thresholds

First, define what constitutes an “extreme” event. This might vary based on the city and the type of event (temperature, rainfall, etc.).

For example, you might consider a day with a temperature above the 95th percentile as extremely hot, or a day with rainfall above the 95th percentile as a day of heavy rainfall.

# Define thresholds for extreme events
temperature_threshold <- quantile(all_cities_df$tavg, 0.95, na.rm = TRUE)
rainfall_threshold <- quantile(all_cities_df$prcp, 0.95, na.rm = TRUE)

# Identify extreme temperature events
all_cities_df$extreme_temp <- all_cities_df$tavg > temperature_threshold

# Identify extreme rainfall events
all_cities_df$extreme_rain <- all_cities_df$prcp > rainfall_threshold


# Analyze extreme temperature events
extreme_temp_analysis <- all_cities_df %>%
    group_by(Year = format(time, "%Y"), City) %>%
    summarise(Extreme_Temp_Days = sum(extreme_temp, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Analyze extreme rainfall events
extreme_rain_analysis <- all_cities_df %>%
    group_by(Year = format(time, "%Y"), City) %>%
    summarise(Extreme_Rain_Days = sum(extreme_rain, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
# Plotting extreme temperature trends
ggplot(extreme_temp_analysis, aes(x = Year, y = Extreme_Temp_Days, group = City, color = City)) +
    geom_line() +
    labs(title = "Yearly Trends of Extreme Temperature Days",
         x = "Year",
         y = "Number of Extreme Temperature Days")

# Plotting extreme rainfall trends
ggplot(extreme_rain_analysis, aes(x = Year, y = Extreme_Rain_Days, group = City, color = City)) +
    geom_line() +
    labs(title = "Yearly Trends of Extreme Rainfall Days",
         x = "Year",
         y = "Number of Extreme Rainfall Days")

Impact of Extreme Weather Events on Specific Months

library(dplyr)
library(lubridate)

# Extract the month from the date
all_cities_df$Month <- format(as.Date(all_cities_df$time), "%m")
all_cities_df$Month <- as.integer(all_cities_df$Month)

# Monthly analysis of extreme temperature events
monthly_extreme_temp <- all_cities_df %>%
    group_by(Month, City) %>%
    summarise(Extreme_Temp_Days = sum(extreme_temp, na.rm = TRUE))
## `summarise()` has grouped output by 'Month'. You can override using the
## `.groups` argument.
# Monthly analysis of extreme rainfall events
monthly_extreme_rain <- all_cities_df %>%
    group_by(Month, City) %>%
    summarise(Extreme_Rain_Days = sum(extreme_rain, na.rm = TRUE))
## `summarise()` has grouped output by 'Month'. You can override using the
## `.groups` argument.
# Plotting monthly extreme temperature trends
ggplot(monthly_extreme_temp, aes(x = Month, y = Extreme_Temp_Days, fill = City)) +
    geom_bar(stat = "identity", position = position_dodge()) +
    labs(title = "Monthly Distribution of Extreme Temperature Days",
         x = "Month",
         y = "Number of Extreme Temperature Days") +
    scale_x_continuous(breaks = 1:12, labels = month.abb)

# Plotting monthly extreme rainfall trends
ggplot(monthly_extreme_rain, aes(x = Month, y = Extreme_Rain_Days, fill = City)) +
    geom_bar(stat = "identity", position = position_dodge()) +
    labs(title = "Monthly Distribution of Extreme Rainfall Days",
         x = "Month",
         y = "Number of Extreme Rainfall Days") +
    scale_x_continuous(breaks = 1:12, labels = month.abb)

Long-Term Climate Change Trends Analysis

# Prepare the data for long-term trend analysis
# Ensure all datasets have a Year column

# Add Year column to each city's dataset
bangalore_df$Year <- format(bangalore_df$time, "%Y")
chennai_df$Year <- format(chennai_df$time, "%Y")
delhi_df$Year <- format(delhi_df$time, "%Y")
lucknow_df$Year <- format(lucknow_df$time, "%Y")
mumbai_df$Year <- format(mumbai_df$time, "%Y")
rajasthan_df$Year <- format(rajasthan_df$time, "%Y")
bhubaneswar_df$Year <- format(bhubaneswar_df$time, "%Y")
rourkela_df$Year <- format(rourkela_df$time, "%Y")

# Combine all datasets into one dataframe
all_cities_df <- rbind(
    bangalore_df %>% mutate(City = "Bangalore"),
    chennai_df %>% mutate(City = "Chennai"),
    delhi_df %>% mutate(City = "Delhi"),
    lucknow_df %>% mutate(City = "Lucknow"),
    mumbai_df %>% mutate(City = "Mumbai"),
    rajasthan_df %>% mutate(City = "Rajasthan"),
    bhubaneswar_df %>% mutate(City = "Bhubaneswar"),
    rourkela_df %>% mutate(City = "Rourkela")
)
all_cities_long_term <- all_cities_df %>%
    mutate(Year = as.numeric(Year),
           Decade = case_when(
               Year >= 1990 & Year < 2000 ~ "1990s",
               Year >= 2000 & Year < 2010 ~ "2000s",
               Year >= 2010 & Year <= 2022 ~ "2010s"
           ))

# Decadal temperature trends
decadal_temp_trends <- all_cities_long_term %>%
    group_by(City, Decade) %>%
    summarise(Mean_Tavg = mean(tavg, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Decadal precipitation trends
decadal_precip_trends <- all_cities_long_term %>%
    group_by(City, Decade) %>%
    summarise(Total_Prcp = sum(prcp, na.rm = TRUE))
## `summarise()` has grouped output by 'City'. You can override using the
## `.groups` argument.
# Temperature trend plot
ggplot(decadal_temp_trends, aes(x = Decade, y = Mean_Tavg, fill = City)) +
    geom_bar(stat = "identity", position = position_dodge()) +
    labs(title = "Decadal Average Temperature Trends",
         x = "Decade",
         y = "Mean Temperature (°C)")

# Precipitation trend plot
ggplot(decadal_precip_trends, aes(x = Decade, y = Total_Prcp, fill = City)) +
    geom_bar(stat = "identity", position = position_dodge()) +
    labs(title = "Decadal Total Precipitation Trends",
         x = "Decade",
         y = "Total Precipitation (mm)")

Concise Summary: Analysis and Interpretation of Climate Trends

In this phase, we analyze temperature and precipitation trends to identify long-term climatic changes. We scrutinize average temperatures across decades for any upward or downward trends, with an increasing trend possibly indicating global warming. Similarly, we examine precipitation patterns over the years to detect any shifts in rainfall. This analysis spans different cities, accounting for their unique geographic and climatic characteristics. It’s important to note, however, that these trends suggest potential changes but don’t confirm causation, as climate change is driven by various complex factors. Through this method, we aim to capture an overarching view of how climate parameters have shifted over the past three decades, shedding light on broader trends in climate change.

Feature Engineering for Rainfall Prediction - Exploration

# Example using Bangalore's data

# Creating new features based on the date
bangalore_df <- bangalore_df %>%
                mutate(Month = format(time, "%m"),
                       Day = format(time, "%d"),
                       DayOfYear = yday(time))

# Removing the original 'time' column
bangalore_df <- select(bangalore_df, -time)


# Splitting the data into training and testing sets
# Assuming 80% training, 20% testing split

set.seed(123) # For reproducibility
training_indices <- sample(1:nrow(bangalore_df), 0.8 * nrow(bangalore_df))

train_data <- bangalore_df[training_indices, ]
test_data <- bangalore_df[-training_indices, ]

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# Using Random Forest for rainfall prediction
rf_model <- randomForest(prcp ~ ., data = train_data)

# Summarizing the model
print(rf_model)
## 
## Call:
##  randomForest(formula = prcp ~ ., data = train_data) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 80.49238
##                     % Var explained: 9.4
# Making predictions on the test data
predictions <- predict(rf_model, test_data)

# Using Mean Absolute Error (MAE) for evaluation
mae <- mean(abs(predictions - test_data$prcp))
print(paste("Mean Absolute Error: ", mae))
## [1] "Mean Absolute Error:  3.79870041260058"

Interpretation of the Results:

Mean of Squared Residuals: This is about 80.49, which gives a sense of the average squared difference between the observed actual outcomes and the values predicted by the model.

Percentage of Variance Explained:

The model explains around 9.4% of the variance in the rainfall data. This is relatively low, suggesting that the model might not be capturing all the complexities and patterns in the rainfall data.

Mean Absolute Error (MAE):

An MAE of 3.80 suggests that, on average, the model’s predictions are about 3.80 units (presumably millimeters if the rainfall is measured that way) away from the actual values.

Predictive Modeling with Decision Trees

Building a Decision Tree Model for Rainfall Prediction

library(rpart)

# Building the decision tree model
dt_model <- rpart(prcp ~ ., data = train_data, method = "anova")

# Printing the model
print(dt_model)
## n= 1753 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 1753 155748.900  6.044540  
##     2) tmin>=20.75 580  18999.970  4.175578 *
##     3) tmin< 20.75 1173 133721.200  6.968665  
##       6) Month=01,02,03,06,07,08,09,10,11,12 1125 110491.400  6.424355  
##        12) Month=01,02,03,06,07,08,11,12 902  53517.360  5.626395  
##          24) tavg>=19.65 895  48763.280  5.503594  
##            48) Day=01,02,03,04,05,06,07,08,09,10,11,13,14,18,19,20,22,23,27,28,29,30,31 654  11950.970  4.783881 *
##            49) Day=12,15,16,17,21,24,25,26 241  35554.250  7.456676  
##              98) Month=01,02,03,06,07,11,12 208  15600.370  6.475991 *
##              99) Month=08 33  18492.960 13.637960  
##               198) Year=2016,2018,2019,2020 21   1633.101  5.788623 *
##               199) Year=2015,2017 12  13301.760 27.374300 *
##          25) tavg< 19.65 7   3014.946 21.327370 *
##        13) Month=09,10 223  54076.570  9.651976  
##          26) Year=2015,2016,2018,2019,2020 184  24055.390  7.484868  
##            52) Day=01,02,03,04,05,06,07,10,12,13,14,15,16,17,19,20,22,23,26,27,29,31 133   5506.941  5.388758 *
##            53) Day=08,09,11,18,21,24,25,28,30 51  16440.170 12.951190  
##             106) tavg>=23.25 34   3662.822  6.785612 *
##             107) tavg< 23.25 17   8899.885 25.282350 *
##          27) Year=2017 39  25080.130 19.876280  
##            54) Day=05,08,11,12,14,16,17,18,19,20,21,22,23,24,26,28,29,30,31 25   1587.206  6.438995 *
##            55) Day=01,02,04,06,09,10,13,15,25,27 14  10918.170 43.871430 *
##       7) Month=04,05 48  15084.630 19.725920  
##        14) Day=01,02,03,08,09,10,11,12,13,14,15,16,18,26,29,30,31 30   2792.180 10.878140 *
##        15) Day=04,17,19,20,21,23,24,27,28 18   6029.796 34.472220 *
# Making predictions on the test data
predictions_dt <- predict(dt_model, test_data)

# Using Mean Absolute Error (MAE) for evaluation
mae_dt <- mean(abs(predictions_dt - test_data$prcp))
print(paste("Mean Absolute Error: ", mae_dt))
## [1] "Mean Absolute Error:  4.27886547420482"
# Additional evaluation metrics - Root Mean Square Error (RMSE)
rmse_dt <- sqrt(mean((predictions_dt - test_data$prcp)^2))
print(paste("Root Mean Square Error: ", rmse_dt))
## [1] "Root Mean Square Error:  7.6573603613743"
library(rpart.plot)

# Plotting the decision tree
rpart.plot(dt_model, main = "Decision Tree for Rainfall Prediction")

# Enhanced plotting of the decision tree
rpart.plot(dt_model, 
           main = "Decision Tree for Rainfall Prediction", 
           type = 4,   # Enhanced tree type with split labels, variable names, and fitted values
           extra = 101, # Display the number of observations in each node
           under = TRUE, # Place node labels under the node (instead of inside it)
           faclen = 0,   # Full factor levels in split labels
           cex = 0.6,    # Size of text
           tweak = 1.2)  # Adjust size and spacing for a cleaner look
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
## Warning: cex and tweak both specified, applying both

Geospatial Analysis:

library(leaflet)

# Ensure the column names are correctly referenced
station_geo_df$Latitude <- as.numeric(station_geo_df$Latitude)
station_geo_df$longitude <- as.numeric(station_geo_df$longitude)

# Create a leaflet map
leaflet(station_geo_df) %>% 
  addTiles() %>% 
  addMarkers(~longitude, ~Latitude, popup = ~Location_Name)
# Creating a color palette
pal <- colorNumeric(palette = "viridis", domain = station_geo_df$Elevation)

leaflet(station_geo_df) %>% 
  addTiles() %>% 
  addCircleMarkers(~longitude, ~Latitude, 
                   popup = ~paste(Location_Name, "Elevation:", Elevation, "m"),
                   color = ~pal(Elevation), fill = TRUE)
library(leaflet)
library(leaflet.extras)

# Calculating average rainfall for each city
average_rainfall <- all_cities_df %>%
  group_by(City) %>%
  summarise(Avg_Rainfall = mean(prcp, na.rm = TRUE))

# Merging with station_geo_df to include geographic coordinates
station_geo_df <- merge(station_geo_df, average_rainfall, by.x = "Location_Name", by.y = "City")

average_rainfall 
## # A tibble: 7 × 2
##   City        Avg_Rainfall
##   <chr>              <dbl>
## 1 Bangalore           5.93
## 2 Bhubaneswar         7.07
## 3 Chennai            11.5 
## 4 Delhi               6.18
## 5 Lucknow             8.70
## 6 Mumbai             23.0 
## 7 Rajasthan           5.93
leaflet(station_geo_df) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addHeatmap(lng = ~longitude, lat = ~Latitude, intensity = ~Avg_Rainfall, radius = 20, blur = 15)

Integrating Air Quality and Weather Data with Geospatial Insights

This section focuses on the seamless integration of air quality and weather data, enriched with geospatial coordinates and detailed city profiles. By blending these diverse datasets, we create a comprehensive perspective that not only assesses environmental parameters but also considers the geographical context of Indian cities. This fusion enables us to understand the intricate relationship between air quality, weather patterns, and their geographical variations across India.

#Convert the Date to date type
aqi_city_day$Date = as_date(aqi_city_day$Date, format='%Y-%m-%d')


#Extract month and year as additional columns
merged <- aqi_city_day
merged <- merged %>% mutate(Month = month(Date))
merged <- merged %>% mutate(Year = year(Date))
merged <- merged %>% mutate(Day = wday(Date, label=TRUE, abbr=FALSE))


#Import Indian Cities database
indian_cities$City = as_factor(indian_cities$City)

#Merge Lat-Long into aqi_day
merged <- merge(merged, indian_cities%>%select("City", "Lat", "Long"), by="City")

#Introduce new column for paritioning into N/S, where North >22.5Lat
merged <-merged %>% mutate(Region = ifelse(Lat>22.5,"North","South"))

#Introduce column for season. Summer:03-06, Rainy:07-10, Winter:11-02
merged <- merged %>% mutate(Season = case_when(Month %in% c(3,4,5,6)~"Summer" , Month %in% c(7,8,9,10) ~"Rainy", Month %in% c(11,12,1,2) ~"Winter" ) )

#Introduce column for partitioning into weekday and weekend. Weekend = Saturday, Sunday; Weekday= others
merged <- merged %>% mutate(DayType = ifelse(Day %in% c("Sunday", "Saturday"), "Weekend", "Weekday"))


yearly_summary <-merged %>% group_by(Year, City) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE)) 
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
yearly_summary <- merge(yearly_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")

seasonal_summary <-merged %>% group_by(Year, City, Season) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE)) 
## `summarise()` has grouped output by 'Year', 'City'. You can override using the
## `.groups` argument.
seasonal_summary <- merge(seasonal_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")

regional_summary <-merged %>% group_by(Year, City, Region) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE)) 
## `summarise()` has grouped output by 'Year', 'City'. You can override using the
## `.groups` argument.
regional_summary <- merge(regional_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")

regional_seasonal_summary <-merged %>% group_by(Year, Region, Season, City) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE)) 
## `summarise()` has grouped output by 'Year', 'Region', 'Season'. You can
## override using the `.groups` argument.
regional_seasonal_summary <- merge(regional_seasonal_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")

weektype_summary <-merged %>% group_by(Year, Month, City, DayType) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE)) 
## `summarise()` has grouped output by 'Year', 'Month', 'City'. You can override
## using the `.groups` argument.
weektype_summary <- merge(weektype_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")

ggplot(yearly_summary, aes(x=Year, y=avg_AQI))+geom_line()+facet_wrap(~City, ncol=6)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Distribution Analysis of AQI Across Cities Over the Years

m_2015<-yearly_summary%>%filter(Year==2015)
m_2016<-yearly_summary%>%filter(Year==2016)
m_2017<-yearly_summary%>%filter(Year==2017)
m_2018<-yearly_summary%>%filter(Year==2018)
m_2019<-yearly_summary%>%filter(Year==2019)
m_2020<-yearly_summary%>%filter(Year==2020)

yearly_summary %>% 
        leaflet()%>%
        addProviderTiles("CartoDB")%>%
        addCircleMarkers(data=m_2015,radius=~avg_AQI/10,popup=~City, group="2015")%>%
        addCircleMarkers(data=m_2016,radius=~avg_AQI/10,popup=~City, group="2016")%>%
        addCircleMarkers(data=m_2017,radius=~avg_AQI/10,popup=~City, group="2017")%>%
        addCircleMarkers(data=m_2018,radius=~avg_AQI/10,popup=~City, group="2018")%>%
        addCircleMarkers(data=m_2019,radius=~avg_AQI/10,popup=~City, group="2019")%>%
        addCircleMarkers(data=m_2020,radius=~avg_AQI/10,popup=~City, group="2020")%>%
        addLayersControl(overlayGroups=c("2015","2016","2017","2018","2019","2020"), layersControlOptions(collapsed=FALSE))-> map
map <- map %>% hideGroup(2016) %>% hideGroup(2017) %>% hideGroup(2018) %>% hideGroup(2019)%>% hideGroup(2020)

map        

Trend in AQI across cities annually for different seasons.

seasonal_summary <-merged %>% group_by(Year, City, Season) %>% summarise(avg_AQI=mean(AQI, na.rm=TRUE)) 
## `summarise()` has grouped output by 'Year', 'City'. You can override using the
## `.groups` argument.
seasonal_summary <- merge(seasonal_summary, indian_cities%>%select("City", "Lat", "Long"), by="City")

ggplot(data = seasonal_summary,aes(x=Year, y=avg_AQI, color=Season))+geom_point()+facet_wrap(~City, ncol=3)
## Warning: Removed 15 rows containing missing values (`geom_point()`).

Capture the seasonal variation for only six cities

# Capture the seasonal variation for only six cities
filtered <- seasonal_summary%>%
            filter(City %in% c("Ahmedabad", "Delhi", "Patna", "Mumbai","Bengaluru"))
ggplot(data = filtered,aes(x=Year, y=avg_AQI, color=Season))+
      geom_point()+
      facet_wrap(~City, ncol=1, scales="free_y")+
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
## Warning: Removed 11 rows containing missing values (`geom_point()`).

ggplot(data=t1, aes(x=State, fill=City))+geom_bar()+ 
              theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position = "none")

summary_merged_weather <-merge(summary_merged_weather, geolocations, by.x = "City", by.y="Location_Name")

#Create a region columne where region=Coastal if elevation < 100, else Noncoastal
summary_merged_weather <- summary_merged_weather %>% mutate(Region = ifelse(Elevation > 100, "Noncoastal", "Coastal"))


#Plot the annual varation in temp and prcp based on coastal and non-coastal
summary_merged_weather %>% filter(Year %in% seq(1990,1999))%>%
                            group_by(Year, Region, City) %>%
                            ggplot(aes(x=City, y=t_avg_annual, color=Region)) +
                            geom_point() +
                            facet_wrap(~Year, ncol=5)+ 
              theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Mean of Pollutant, Anually.
gathered_df%>%filter(City %in% c("Ahmedabad", "Bengaluru", "Mumbai", "Delhi", "Patna")) %>%
                     filter(!Pollutant %in% c("PM2.5","PM10")) %>%
                     group_by(City,Year)%>%
                     ggplot(aes(x=Pollutant, y=Avg, color=Pollutant, na.rm=TRUE))+
                     geom_col(aes(fill=Pollutant))+
                     facet_grid(vars(City),vars(Year), scales="free_y")+ 
                     theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
## Warning: Removed 34 rows containing missing values (`position_stack()`).

Prediction Model for AQI

library(dplyr)
library(ggplot2)

# Calculate the proportion of missing values for each variable
miss_summary <- merged_aqi_weather %>%
                summarise(across(everything(), ~sum(is.na(.))/n())) %>%
                pivot_longer(everything(), names_to = "Variable", values_to = "MissingProportion")

# Plot the missingness summary for the entire dataset
ggplot(miss_summary, aes(x = Variable, y = MissingProportion)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

# Calculate the proportion of missing values for each variable, grouped by City
miss_summary_city <- merged_aqi_weather %>%
                     group_by(City) %>%
                     summarise(across(everything(), ~sum(is.na(.))/n())) %>%
                     pivot_longer(-City, names_to = "Variable", values_to = "MissingProportion")

# Plot the missingness summary by city
ggplot(miss_summary_city, aes(x = Variable, y = MissingProportion, fill = City)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

library(statisticalModeling)

merged_aqi_weather$training_cases <- rnorm(nrow(merged_aqi_weather)) > 0

# Build base model AQI ~ tavg + prcp + Month + PM2.5 + PM10 + O3 with training cases
model1 <- lm(AQI ~ tavg + prcp + Month + PM2.5 + PM10 + O3, data = subset(merged_aqi_weather, training_cases))

# Evaluate the model for the testing cases
pred_model1 <- evaluate_model(model1, data = subset(merged_aqi_weather,!training_cases))

# Calculate the MSE on the testing data
with(data = pred_model1, mean((AQI - pred_model1$model_output)^2)) ->in_sample_error1

testing_data<-subset(merged_aqi_weather, !training_cases)
plot_data1<-data.frame(predicted=pred_model1$model_output, actual=testing_data$AQI)

ggplot(data=plot_data1, aes(x=predicted, y=actual))+geom_point()+geom_abline(intercept = 0, slope =1, color ="red")

Final Conclusions and Observations

Air Quality Analysis

  • High AQI in Urban Centers: Cities like Ahmedabad, Delhi, Mumbai, Patna, and Bangalore have consistently high AQI values, indicating a significant level of air pollution.

  • Temporal Variations in AQI: AQI tends to be higher during winter months due to atmospheric conditions that trap pollutants. Conversely, it decreases during the rainy season when pollutants are washed away by rainfall.

  • Geographical Disparity in AQI: Northern Indian cities generally exhibit higher AQI values compared to those in the Southern region, likely due to differences in industrial activity, population density, and meteorological factors.

  • Weekday vs. Weekend AQI: No substantial difference in AQI was observed between weekdays and weekends, suggesting a persistent and consistent level of air pollution irrespective of the weekly cycle.

Temperature Trends

  • Stable Temperatures in Some Cities: Chennai, Rajasthan, Bangalore, and Mumbai show minor fluctuations in average annual temperatures. Rising Temperatures in Delhi and Lucknow: These cities exhibit a noticeable and consistent increase in temperatures, potentially indicative of urban heat island effects and broader climate change impacts.

  • Higher Temperatures in Coastal Cities: Mumbai and Chennai, being coastal cities, consistently record higher average temperatures compared to non-coastal cities.

  • Warming Trend in Non-Coastal Cities: Non-coastal cities, especially Delhi and Lucknow, show a trend of increasing temperatures over recent years.

Precipitation Patterns

  • Increased Rainfall in Recent Decades: Most cities have experienced an increase in average annual precipitation since 2004, which may be linked to changing climate patterns.

  • Higher Precipitation in Coastal Regions: Coastal regions, particularly Mumbai, receive significantly more rainfall compared to non-coastal areas.

  • Mumbai’s Exceptional Rainfall: Among coastal cities, Mumbai stands out with substantially higher precipitation levels than Chennai.

  • Stable Precipitation in Non-Coastal Regions: Non-coastal regions do not show significant changes in average precipitation over the past 30 years.

Predictive Modeling

  • Predictive models created using the datasets have helped understand the complex interplay between various environmental factors like temperature, precipitation, and air pollutants.

  • Climate and Air Quality Interactions: The models underscore the impact of climatic factors on air quality, illustrating how weather patterns can influence pollutant dispersion and concentration.

  • Geospatial Insights: The inclusion of geographical data (latitude, longitude, elevation) provided nuanced insights into regional variations in climate and air quality.

Overall Insights

The comprehensive analysis combining air quality, temperature, and precipitation data reveals a multi-faceted picture of environmental conditions across Indian cities. The findings highlight the urgency of addressing urban pollution and the importance of monitoring climate trends to inform policy and urban planning. The observed trends and patterns in these environmental parameters are critical for understanding the broader impacts of urbanization and climate change on public health and ecosystems.

References

To ensure a comprehensive understanding and accurate interpretation of the data analysis presented in this report, the following sources and references have been utilized:

  1. World Air Quality Report:
    • Source: IQAir
    • URL: IQAir World Air Quality Report
    • Purpose: Provides global air quality data and insights, useful for comparative analysis.
  2. Indian Meteorological Department (IMD):
    • Source: Government of India
    • URL: IMD Official Website
    • Purpose: Official source for historical and current weather data in India.
  3. Central Pollution Control Board (CPCB):
    • Source: Ministry of Environment, Forest and Climate Change, Government of India
    • URL: CPCB Official Website
    • Purpose: Official repository of air quality data in India.
  4. World Health Organization (WHO) Guidelines for Air Quality:
    • Source: World Health Organization
    • URL: WHO Air Quality Guidelines
    • Purpose: Provides guidelines on air quality standards and their health impacts.
  5. “Air Pollution and Climate Change: A Great and Growing Menace”:
    • Article in Environmental Health Perspectives
    • URL: EHP Article
    • Purpose: Discusses the relationship between air pollution and climate change.
  6. NASA’s Climate Change and Global Warming:
    • Source: NASA
    • URL: NASA Climate Change
    • Purpose: For insights into global climate change patterns and scientific explanations.
  7. “The Effects of Climate Change on Air Quality and Health in India”:
    • Journal: International Journal of Environmental Research and Public Health
    • URL: IJERPH Article
    • Purpose: Provides specific insights into how climate change affects air quality and health in India.
  8. “Urban Heat Island Effect in Indian Cities: Implications and Mitigation Strategies”:
    • Journal: Urban Climate
    • URL: Urban Climate Journal
    • Purpose: Research on urban heat island effects in Indian cities.
  9. Kaggle – Indian Cities Dataset:
    • Source: Kaggle
    • URL: Kaggle Indian Cities Dataset
    • Purpose: Used for geospatial data on Indian cities.
  10. “Statistical Analysis of Air Quality Data”:
    • Book: “Modern Statistical Techniques for the Analysis of Longitudinal Data in Biomedical Research”
    • Publisher: American Statistical Association
    • Purpose: Provides methodology for statistical analysis of air quality data.